The Policy Maker of a tourism company named "Visit with us" wants to enable and establish a viable business model to expand the customer base. The company wants to expand the customer base by introducing a new offering of packages. The company is attempting to identify the potential customers. The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being.
This time company wants to harness the available data of existing and potential customers to target the right customers.
# import all the python packages that will be needed. If any package is missing, install it first using this command:
# !pip install <package name>
import pandas as pd
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
#To install xgboost library use - !pip install xgboost
from xgboost import XGBClassifier
from pandas_profiling import ProfileReport
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
import scipy.stats as stats
from sklearn import tree
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
import kds # https://pypi.org/project/kds/ Key to Data Science for Gain and Decile/ Rank Ordering Table
import xgboost as xgb
from xgboost import XGBClassifier
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
warnings.filterwarnings("ignore") # During the final run, warnings can be disabled.
# Read in the data into a pandas dataframe
tdata = pd.read_excel('Tourism.xlsx', sheet_name=1)
# copying data to another varaible to avoid any changes to original data
data=tdata.copy()
# Let's view the data to ensure it has been read correctly.
data.head()
| CustomerID | ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 200000 | 1 | 41.0 | Self Enquiry | 3 | 6.0 | Salaried | Female | 3 | 3.0 | Deluxe | 3.0 | Single | 1.0 | 1 | 2 | 1 | 0.0 | Manager | 20993.0 |
| 1 | 200001 | 0 | 49.0 | Company Invited | 1 | 14.0 | Salaried | Male | 3 | 4.0 | Deluxe | 4.0 | Divorced | 2.0 | 0 | 3 | 1 | 2.0 | Manager | 20130.0 |
| 2 | 200002 | 1 | 37.0 | Self Enquiry | 1 | 8.0 | Free Lancer | Male | 3 | 4.0 | Basic | 3.0 | Single | 7.0 | 1 | 3 | 0 | 0.0 | Executive | 17090.0 |
| 3 | 200003 | 0 | 33.0 | Company Invited | 1 | 9.0 | Salaried | Female | 2 | 3.0 | Basic | 3.0 | Divorced | 2.0 | 1 | 5 | 1 | 1.0 | Executive | 17909.0 |
| 4 | 200004 | 0 | NaN | Self Enquiry | 1 | 8.0 | Small Business | Male | 2 | 3.0 | Basic | 4.0 | Divorced | 1.0 | 0 | 5 | 1 | 0.0 | Executive | 18468.0 |
data.tail()
| CustomerID | ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4883 | 204883 | 1 | 49.0 | Self Enquiry | 3 | 9.0 | Small Business | Male | 3 | 5.0 | Deluxe | 4.0 | Unmarried | 2.0 | 1 | 1 | 1 | 1.0 | Manager | 26576.0 |
| 4884 | 204884 | 1 | 28.0 | Company Invited | 1 | 31.0 | Salaried | Male | 4 | 5.0 | Basic | 3.0 | Single | 3.0 | 1 | 3 | 1 | 2.0 | Executive | 21212.0 |
| 4885 | 204885 | 1 | 52.0 | Self Enquiry | 3 | 17.0 | Salaried | Female | 4 | 4.0 | Standard | 4.0 | Married | 7.0 | 0 | 1 | 1 | 3.0 | Senior Manager | 31820.0 |
| 4886 | 204886 | 1 | 19.0 | Self Enquiry | 3 | 16.0 | Small Business | Male | 3 | 4.0 | Basic | 3.0 | Single | 3.0 | 0 | 5 | 0 | 2.0 | Executive | 20289.0 |
| 4887 | 204887 | 1 | 36.0 | Self Enquiry | 1 | 14.0 | Salaried | Male | 4 | 4.0 | Basic | 4.0 | Unmarried | 3.0 | 1 | 3 | 1 | 2.0 | Executive | 24041.0 |
print('There are', data.shape[0], 'rows and', data.shape[1], 'attributes.')
There are 4888 rows and 20 attributes.
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4888 entries, 0 to 4887 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerID 4888 non-null int64 1 ProdTaken 4888 non-null int64 2 Age 4662 non-null float64 3 TypeofContact 4863 non-null object 4 CityTier 4888 non-null int64 5 DurationOfPitch 4637 non-null float64 6 Occupation 4888 non-null object 7 Gender 4888 non-null object 8 NumberOfPersonVisiting 4888 non-null int64 9 NumberOfFollowups 4843 non-null float64 10 ProductPitched 4888 non-null object 11 PreferredPropertyStar 4862 non-null float64 12 MaritalStatus 4888 non-null object 13 NumberOfTrips 4748 non-null float64 14 Passport 4888 non-null int64 15 PitchSatisfactionScore 4888 non-null int64 16 OwnCar 4888 non-null int64 17 NumberOfChildrenVisiting 4822 non-null float64 18 Designation 4888 non-null object 19 MonthlyIncome 4655 non-null float64 dtypes: float64(7), int64(7), object(6) memory usage: 763.9+ KB
# Note that the modelling needs numeric attributes. Since we have a mix of numeric and non-numeric attributes, let's look at the latter.
cols = data.select_dtypes(['object'])
cols.columns
Index(['TypeofContact', 'Occupation', 'Gender', 'ProductPitched',
'MaritalStatus', 'Designation'],
dtype='object')
# We will have to deal with the above attributes and convert to numeric. They appear to be categorical. If they are ordinal, we can convert their values to numbers.
# Otherwise the get_dummies() can be used.
# Let's look at how the numeric attributes look.
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CustomerID | 4888.0 | 202443.500000 | 1411.188388 | 200000.0 | 201221.75 | 202443.5 | 203665.25 | 204887.0 |
| ProdTaken | 4888.0 | 0.188216 | 0.390925 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Age | 4662.0 | 37.622265 | 9.316387 | 18.0 | 31.00 | 36.0 | 44.00 | 61.0 |
| CityTier | 4888.0 | 1.654255 | 0.916583 | 1.0 | 1.00 | 1.0 | 3.00 | 3.0 |
| DurationOfPitch | 4637.0 | 15.490835 | 8.519643 | 5.0 | 9.00 | 13.0 | 20.00 | 127.0 |
| NumberOfPersonVisiting | 4888.0 | 2.905074 | 0.724891 | 1.0 | 2.00 | 3.0 | 3.00 | 5.0 |
| NumberOfFollowups | 4843.0 | 3.708445 | 1.002509 | 1.0 | 3.00 | 4.0 | 4.00 | 6.0 |
| PreferredPropertyStar | 4862.0 | 3.581037 | 0.798009 | 3.0 | 3.00 | 3.0 | 4.00 | 5.0 |
| NumberOfTrips | 4748.0 | 3.236521 | 1.849019 | 1.0 | 2.00 | 3.0 | 4.00 | 22.0 |
| Passport | 4888.0 | 0.290917 | 0.454232 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
| PitchSatisfactionScore | 4888.0 | 3.078151 | 1.365792 | 1.0 | 2.00 | 3.0 | 4.00 | 5.0 |
| OwnCar | 4888.0 | 0.620295 | 0.485363 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| NumberOfChildrenVisiting | 4822.0 | 1.187267 | 0.857861 | 0.0 | 1.00 | 1.0 | 2.00 | 3.0 |
| MonthlyIncome | 4655.0 | 23619.853491 | 5380.698361 | 1000.0 | 20346.00 | 22347.0 | 25571.00 | 98678.0 |
# Let's use the pandas profile package to create a basic report on the dataset
# The profile report is in a cell with its own scrolling. So scroll within the cell to see the entire report.
profile = ProfileReport(data)
profile
# Save the report file as an HTML for future viewing
profile.to_file("data_profile4.html")
# CustomerID is seen to be a unique attribute above. So it is an ID attribute and is dropped below.
data.drop(['CustomerID'],axis=1,inplace=True)
# Let's see the distribution of values for the attributes which are of object type and are potentially categorical.
cols_cat= data.select_dtypes(['object'])
for i in cols_cat.columns:
print('Unique values in',i, 'are :')
print(cols_cat[i].value_counts(dropna=False))
print('-'*50)
Unique values in TypeofContact are : Self Enquiry 3444 Company Invited 1419 NaN 25 Name: TypeofContact, dtype: int64 -------------------------------------------------- Unique values in Occupation are : Salaried 2368 Small Business 2084 Large Business 434 Free Lancer 2 Name: Occupation, dtype: int64 -------------------------------------------------- Unique values in Gender are : Male 2916 Female 1817 Fe Male 155 Name: Gender, dtype: int64 -------------------------------------------------- Unique values in ProductPitched are : Basic 1842 Deluxe 1732 Standard 742 Super Deluxe 342 King 230 Name: ProductPitched, dtype: int64 -------------------------------------------------- Unique values in MaritalStatus are : Married 2340 Divorced 950 Single 916 Unmarried 682 Name: MaritalStatus, dtype: int64 -------------------------------------------------- Unique values in Designation are : Executive 1842 Manager 1732 Senior Manager 742 AVP 342 VP 230 Name: Designation, dtype: int64 --------------------------------------------------
# The code for these utility functions were shared during the instructor led sessions. Reusing the code.
# edafunctions
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
import sys, os
#sys.path.append('/Users/Dhivya/Documents/Dhivya/Downloads')
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
import sys, os
#sys.path.append('/Users/Dhivya/Documents/Dhivya/Downloads')
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="PuBu",
order=data[feature].value_counts(ascending=True).index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
import sys, os
#sys.path.append('/Users/Dhivya/Documents/Dhivya/Downloads')
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1] # Sort by True
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 50)
tab = pd.crosstab(data[predictor], data[target], margins=True,normalize="index").sort_values(
by=sorter, ascending=False
).round(4)*100
print(tab)
print("-" * 50)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
#plt.legend(
#loc="lower left", frameon=True,
#)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
# Let's look at histogram and boxplots of numeric columns
cols = data.select_dtypes(['int64','float64']).columns
for col in cols:
histogram_boxplot(data, col)
# Let's continue the EDA/univariate analysis before we comment on the observations.
# Let's look at the bar plot of counts for the non-numeric attributes.
cols = data.select_dtypes(['object','category']).columns
for col in cols:
labeled_barplot(data, col, True)
# Let's fix certain data issues before continuing the data analysis. That way we will get a richer data analysis.
# Based on the boxplots above, these attributes could do with some outlier fixes. We will cap any high outlier value to Q3+(1.5*IQR
# There are no low outliers for these attributes.
cols = ['DurationOfPitch','NumberOfTrips']
print("Before Outlier treatment")
print(data[cols].describe())
for col in cols:
Q1 = data[col].quantile(0.25)
Q3 = data[col].quantile(0.75)
IQR = Q3 - Q1
data[col] = np.where(data[col] > (Q3+(1.5*IQR)),(Q3+(1.5*IQR)),data[col])
print("")
print("After Outlier treatment")
print(data[cols].describe())
Before Outlier treatment
DurationOfPitch NumberOfTrips
count 4637.000000 4748.000000
mean 15.490835 3.236521
std 8.519643 1.849019
min 5.000000 1.000000
25% 9.000000 2.000000
50% 13.000000 3.000000
75% 20.000000 4.000000
max 127.000000 22.000000
After Outlier treatment
DurationOfPitch NumberOfTrips
count 4637.000000 4748.000000
mean 15.452016 3.203033
std 8.213213 1.728842
min 5.000000 1.000000
25% 9.000000 2.000000
50% 13.000000 3.000000
75% 20.000000 4.000000
max 36.500000 7.000000
# MonthlyIncome is a highly skewed attribute Let's look at its log.
print(data["MonthlyIncome"].describe())
log_income = pd.DataFrame(np.log(data["MonthlyIncome"]))
histogram_boxplot(log_income, "MonthlyIncome")
count 4655.000000 mean 23619.853491 std 5380.698361 min 1000.000000 25% 20346.000000 50% 22347.000000 75% 25571.000000 max 98678.000000 Name: MonthlyIncome, dtype: float64
# Let's add the log attribute. There are no 0 or negatvie income rows. So log works.
data["logIncome"] = np.log(data["MonthlyIncome"])
# We have taken care of the skew. There are still outliers though. More on the low and high sides. Let's fix them.
cols = ['logIncome']
print("Before Outlier treatment")
print(data[cols].describe())
for col in cols:
Q1 = data[col].quantile(0.25)
Q3 = data[col].quantile(0.75)
IQR = Q3 - Q1
data[col] = np.where(data[col] > (Q3+(1.5*IQR)),(Q3+(1.5*IQR)),data[col])
data[col] = np.where(data[col] < (Q1-(1.5*IQR)),(Q1-(1.5*IQR)),data[col])
print("")
print("After Outlier treatment")
print(data[cols].describe())
Before Outlier treatment
logIncome
count 4655.000000
mean 10.046740
std 0.213016
min 6.907755
25% 9.920640
50% 10.014447
75% 10.149214
max 11.499617
After Outlier treatment
logIncome
count 4655.000000
mean 10.045784
std 0.201834
min 9.577778
25% 9.920640
50% 10.014447
75% 10.149214
max 10.492076
# Let's drop the original Income attribute.
data = data.drop(["MonthlyIncome"], axis=1)
Even though missing values is not as big a deal with Decion trees as they are with other methods, we will still make an attempt to deal with the missing values.
# Now, let's handle the missing values. Thow code below will show the count of missing values.
data.isnull().sum()
ProdTaken 0 Age 226 TypeofContact 25 CityTier 0 DurationOfPitch 251 Occupation 0 Gender 0 NumberOfPersonVisiting 0 NumberOfFollowups 45 ProductPitched 0 PreferredPropertyStar 26 MaritalStatus 0 NumberOfTrips 140 Passport 0 PitchSatisfactionScore 0 OwnCar 0 NumberOfChildrenVisiting 66 Designation 0 logIncome 233 dtype: int64
# If we want to fill the missing values, we will use the median/mode. That depends on the data type and whether it is a categorical attribute.
# So let's print the data types.
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4888 entries, 0 to 4887 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ProdTaken 4888 non-null int64 1 Age 4662 non-null float64 2 TypeofContact 4863 non-null object 3 CityTier 4888 non-null int64 4 DurationOfPitch 4637 non-null float64 5 Occupation 4888 non-null object 6 Gender 4888 non-null object 7 NumberOfPersonVisiting 4888 non-null int64 8 NumberOfFollowups 4843 non-null float64 9 ProductPitched 4888 non-null object 10 PreferredPropertyStar 4862 non-null float64 11 MaritalStatus 4888 non-null object 12 NumberOfTrips 4748 non-null float64 13 Passport 4888 non-null int64 14 PitchSatisfactionScore 4888 non-null int64 15 OwnCar 4888 non-null int64 16 NumberOfChildrenVisiting 4822 non-null float64 17 Designation 4888 non-null object 18 logIncome 4655 non-null float64 dtypes: float64(7), int64(6), object(6) memory usage: 725.7+ KB
# For the numeric attributes, we will take the median. For the categorical/non-numeric, we will use mode.
cols_median = ['Age','DurationOfPitch','logIncome']
for col in cols_median:
data[col].fillna(data[col].median(),inplace=True)
cols_mode = ['TypeofContact','NumberOfFollowups','PreferredPropertyStar','NumberOfTrips','NumberOfChildrenVisiting']
for col in cols_mode:
data[col] = data[col].fillna(data[col].mode()[0])
# Let's check if we fix the issue
data.isnull().sum()
ProdTaken 0 Age 0 TypeofContact 0 CityTier 0 DurationOfPitch 0 Occupation 0 Gender 0 NumberOfPersonVisiting 0 NumberOfFollowups 0 ProductPitched 0 PreferredPropertyStar 0 MaritalStatus 0 NumberOfTrips 0 Passport 0 PitchSatisfactionScore 0 OwnCar 0 NumberOfChildrenVisiting 0 Designation 0 logIncome 0 dtype: int64
# Now, let's fix the typo in Gender attribute.
print(data['Gender'].value_counts())
data.loc[data["Gender"] == "Fe Male", "Gender"] = 'Female'
print(data['Gender'].value_counts())
Male 2916 Female 1817 Fe Male 155 Name: Gender, dtype: int64 Male 2916 Female 1972 Name: Gender, dtype: int64
# correlation among attributes
plt.figure(figsize=(20,10))
sns.heatmap(data.corr(), annot=True, fmt=".2f", cmap="PuBuGn")
plt.show()
sns.pairplot(data,hue='ProdTaken')
plt.show()
# Let's see how the distribution of different attributes look for the two different values of the target variable
cols = data[['Age','DurationOfPitch','NumberOfPersonVisiting','NumberOfFollowups']].columns.tolist()
hue_col = 'ProdTaken'
plt.figure(figsize=(10,10))
for i, variable in enumerate(cols):
plt.subplot(3,2,i+1)
sns.boxplot(data[hue_col],data[variable],palette="PuBu")
plt.tight_layout()
plt.title(variable)
plt.show()
cols = data[['NumberOfTrips','PitchSatisfactionScore','logIncome']].columns.tolist()
hue_col = 'ProdTaken'
plt.figure(figsize=(10,10))
for i, variable in enumerate(cols):
plt.subplot(3,2,i+1)
sns.boxplot(data[hue_col],data[variable],palette="PuBu")
plt.tight_layout()
plt.title(variable)
plt.show()
# Now let's look at the categorical attributes.
cols = ['TypeofContact','Occupation','Gender','ProductPitched','MaritalStatus','Designation','Passport']
hue_col = 'ProdTaken'
for col in cols:
stacked_barplot(data, col, hue_col)
ProdTaken 0 1 All TypeofContact All 3968 920 4888 Self Enquiry 2859 610 3469 Company Invited 1109 310 1419 -------------------------------------------------- ProdTaken 0 1 TypeofContact Company Invited 78.15 21.85 All 81.18 18.82 Self Enquiry 82.42 17.58 --------------------------------------------------
ProdTaken 0 1 All Occupation All 3968 920 4888 Salaried 1954 414 2368 Small Business 1700 384 2084 Large Business 314 120 434 Free Lancer 0 2 2 -------------------------------------------------- ProdTaken 0 1 Occupation Free Lancer 0.00 100.00 Large Business 72.35 27.65 All 81.18 18.82 Small Business 81.57 18.43 Salaried 82.52 17.48 --------------------------------------------------
ProdTaken 0 1 All Gender All 3968 920 4888 Male 2338 578 2916 Female 1630 342 1972 -------------------------------------------------- ProdTaken 0 1 Gender Male 80.18 19.82 All 81.18 18.82 Female 82.66 17.34 --------------------------------------------------
ProdTaken 0 1 All ProductPitched All 3968 920 4888 Basic 1290 552 1842 Deluxe 1528 204 1732 Standard 618 124 742 King 210 20 230 Super Deluxe 322 20 342 -------------------------------------------------- ProdTaken 0 1 ProductPitched Basic 70.03 29.97 All 81.18 18.82 Standard 83.29 16.71 Deluxe 88.22 11.78 King 91.30 8.70 Super Deluxe 94.15 5.85 --------------------------------------------------
ProdTaken 0 1 All MaritalStatus All 3968 920 4888 Married 2014 326 2340 Single 612 304 916 Unmarried 516 166 682 Divorced 826 124 950 -------------------------------------------------- ProdTaken 0 1 MaritalStatus Single 66.81 33.19 Unmarried 75.66 24.34 All 81.18 18.82 Married 86.07 13.93 Divorced 86.95 13.05 --------------------------------------------------
ProdTaken 0 1 All Designation All 3968 920 4888 Executive 1290 552 1842 Manager 1528 204 1732 Senior Manager 618 124 742 AVP 322 20 342 VP 210 20 230 -------------------------------------------------- ProdTaken 0 1 Designation Executive 70.03 29.97 All 81.18 18.82 Senior Manager 83.29 16.71 Manager 88.22 11.78 VP 91.30 8.70 AVP 94.15 5.85 --------------------------------------------------
ProdTaken 0 1 All Passport All 3968 920 4888 1 928 494 1422 0 3040 426 3466 -------------------------------------------------- ProdTaken 0 1 Passport 1 65.26 34.74 All 81.18 18.82 0 87.71 12.29 --------------------------------------------------
# Now let's break down a few of the plots by an additional attribute.
cols = ['Designation','MaritalStatus','ProductPitched']
ycol = 'Age'
huecol = 'ProdTaken'
for col in cols:
plt.figure(figsize=(15,5))
sns.boxplot(data[col],data[ycol],hue=data[huecol])
plt.show()
cols = ['Designation','MaritalStatus','ProductPitched']
ycol = 'logIncome'
huecol = 'ProdTaken'
for col in cols:
plt.figure(figsize=(15,5))
sns.boxplot(data[col],data[ycol],hue=data[huecol])
plt.show()
# Let's now look at a few (three) attributes together. This one is a line chart.
plt.figure(figsize=(15, 7))
sns.lineplot(data["Age"], data["logIncome"], hue=data["ProdTaken"], ci=0)
plt.legend(bbox_to_anchor=(1.00, 1))
plt.show()
Customers who take up the product (i.e. ProdTaken = 1) with Designation Executive, Senior Manager, and AVP have a lower median age. Those with a designation of VP have a higher median age. For the designation of Manager, there is no significant difference.
Across all Marital statuses, younger customers seem to take the product.
When it comes to ProductPitched, Basic, Standard, and Super Deluxe customers are younger. King customers are older. For Deluxe, there's no significant difference.
For Senior Manager, VP, and AVP designations, those with lower Income take up the product. For Manager, those who have taken the product have a higher median income.
Across all marital statuses, those who have taken the product have lower median income.
When it comes to ProductPitched, Standard, King, and Super Deluxe customers have a lower median income. Among Deluxe customers, those who take the product, have a higher median income. For Basic, there's no significant difference.
Passport = 1 customers have a much higher percentage taking up the new product.
# Let's complete the feature engineering, taking care of both ordinal and other categorical attributes.
replaceStruct = {
"ProductPitched":{"Basic": 1, "Standard": 2 ,"Deluxe": 3 ,"King":4,"Super Deluxe":5},
"Designation": {"Executive": 1, "Manager":2 , "Senior Manager": 3, "AVP": 4,"VP":5}
}
oneHotCols=["TypeofContact","Occupation","Gender","MaritalStatus"]
data=data.replace(replaceStruct)
data=pd.get_dummies(data, columns=oneHotCols,drop_first=True)
data.head(10)
| ProdTaken | Age | CityTier | DurationOfPitch | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | logIncome | TypeofContact_Self Enquiry | Occupation_Large Business | Occupation_Salaried | Occupation_Small Business | Gender_Male | MaritalStatus_Married | MaritalStatus_Single | MaritalStatus_Unmarried | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 41.0 | 3 | 6.0 | 3 | 3.0 | 3 | 3.0 | 1.0 | 1 | 2 | 1 | 0.0 | 2 | 9.951944 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
| 1 | 0 | 49.0 | 1 | 14.0 | 3 | 4.0 | 3 | 4.0 | 2.0 | 0 | 3 | 1 | 2.0 | 2 | 9.909967 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
| 2 | 1 | 37.0 | 1 | 8.0 | 3 | 4.0 | 1 | 3.0 | 7.0 | 1 | 3 | 0 | 0.0 | 1 | 9.746249 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
| 3 | 0 | 33.0 | 1 | 9.0 | 2 | 3.0 | 1 | 3.0 | 2.0 | 1 | 5 | 1 | 1.0 | 1 | 9.793059 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 36.0 | 1 | 8.0 | 2 | 3.0 | 1 | 4.0 | 1.0 | 0 | 5 | 1 | 0.0 | 1 | 9.823795 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
| 5 | 0 | 32.0 | 1 | 8.0 | 3 | 3.0 | 1 | 3.0 | 1.0 | 0 | 5 | 1 | 1.0 | 1 | 9.801898 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 6 | 0 | 59.0 | 1 | 9.0 | 2 | 2.0 | 1 | 5.0 | 5.0 | 1 | 2 | 1 | 1.0 | 1 | 9.779624 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 7 | 0 | 30.0 | 1 | 30.0 | 3 | 3.0 | 1 | 3.0 | 2.0 | 0 | 2 | 0 | 1.0 | 1 | 9.780924 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 |
| 8 | 0 | 38.0 | 1 | 29.0 | 2 | 4.0 | 2 | 3.0 | 1.0 | 0 | 3 | 0 | 0.0 | 3 | 10.107489 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| 9 | 0 | 36.0 | 1 | 33.0 | 3 | 3.0 | 3 | 3.0 | 7.0 | 0 | 3 | 1 | 0.0 | 2 | 9.915268 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4888 entries, 0 to 4887 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ProdTaken 4888 non-null int64 1 Age 4888 non-null float64 2 CityTier 4888 non-null int64 3 DurationOfPitch 4888 non-null float64 4 NumberOfPersonVisiting 4888 non-null int64 5 NumberOfFollowups 4888 non-null float64 6 ProductPitched 4888 non-null int64 7 PreferredPropertyStar 4888 non-null float64 8 NumberOfTrips 4888 non-null float64 9 Passport 4888 non-null int64 10 PitchSatisfactionScore 4888 non-null int64 11 OwnCar 4888 non-null int64 12 NumberOfChildrenVisiting 4888 non-null float64 13 Designation 4888 non-null int64 14 logIncome 4888 non-null float64 15 TypeofContact_Self Enquiry 4888 non-null uint8 16 Occupation_Large Business 4888 non-null uint8 17 Occupation_Salaried 4888 non-null uint8 18 Occupation_Small Business 4888 non-null uint8 19 Gender_Male 4888 non-null uint8 20 MaritalStatus_Married 4888 non-null uint8 21 MaritalStatus_Single 4888 non-null uint8 22 MaritalStatus_Unmarried 4888 non-null uint8 dtypes: float64(7), int64(8), uint8(8) memory usage: 611.1 KB
Data Description:
Data Cleaning:
Observations from EDA:
Customers who take up the product (i.e. ProdTaken = 1) with Designation Executive, Senior Manager, and AVP have a lower median age. Those with a designation of VP have a higher median age. For the designation of Manager, there is no significant difference.
Across all Marital statuses, younger customers seem to take the product.
When it comes to ProductPitched, Basic, Standard, and Super Deluxe customers are younger. King customers are older. For Deluxe, there's no significant difference.
For Senior Manager, VP, and AVP designations, those with lower Income take up the product. For Manager, those who have taken the product have a higher median income.
Across all marital statuses, those who have taken the product have lower median income.
When it comes to ProductPitched, Standard, King, and Super Deluxe customers have a lower median income. Among Deluxe customers, those who take the product, have a higher median income. For Basic, there's no significant difference.
Customers for whom the target attribute (ProdTaken) has a value of 1, tend to be:
Let's get the distribution of the target variable.
data['ProdTaken'].value_counts(normalize=True)
0 0.811784 1 0.188216 Name: ProdTaken, dtype: float64
# Split the dataset into trainig and test subsets.
X = data.drop(['ProdTaken'],axis=1)
y = data['ProdTaken']
# Splitting data into training and test set:
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.3, random_state=1,stratify=y)
print(X_train.shape, X_test.shape)
(3421, 22) (1467, 22)
# Print the target variable's distribution in the training and test datasets.
print(y_train.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))
0 0.811751 1 0.188249 Name: ProdTaken, dtype: float64 0 0.811861 1 0.188139 Name: ProdTaken, dtype: float64
# The code is reused from the weekend instructor led sessions.
# Let's define function to provide metric scores(accuracy,recall and precision) on train and test set and a function to show confusion matrix so that we do not
# have use the same code repetitively while evaluating models.
## Function to calculate recall score
def get_recall_score(model,flag=True):
'''
model : classifier to predict values of X
'''
a = [] # defining an empty list to store train and test results
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
train_recall = metrics.recall_score(y_train,pred_train)
test_recall = metrics.recall_score(y_test,pred_test)
a.append(train_recall) # adding train recall to list
a.append(test_recall) # adding test recall to list
if flag == True: # If the flag is set to True then only the following print statements will be dispayed
print("Recall on training set : ",metrics.recall_score(y_train,pred_train))
print("Recall on test set : ",metrics.recall_score(y_test,pred_test))
return a # returning the list with train and test scores
## Function to calculate precision score
def get_precision_score(model,flag=True):
'''
model : classifier to predict values of X
'''
b = [] # defining an empty list to store train and test results
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
train_precision = metrics.precision_score(y_train,pred_train)
test_precision = metrics.precision_score(y_test,pred_test)
b.append(train_precision) # adding train precision to list
b.append(test_precision) # adding test precision to list
if flag == True: # If the flag is set to True then only the following print statements will be dispayed
print("Precision on training set : ",metrics.precision_score(y_train,pred_train))
print("Precision on test set : ",metrics.precision_score(y_test,pred_test))
return b # returning the list with train and test scores
## Function to calculate accuracy score
def get_accuracy_score(model,flag=True):
'''
model : classifier to predict values of X
'''
c = [] # defining an empty list to store train and test results
train_acc = model.score(X_train,y_train)
test_acc = model.score(X_test,y_test)
c.append(train_acc) # adding train accuracy to list
c.append(test_acc) # adding test accuracy to list
if flag == True: # If the flag is set to True then only the following print statements will be dispayed
print("Accuracy on training set : ",model.score(X_train,y_train))
print("Accuracy on test set : ",model.score(X_test,y_test))
return c # returning the list with train and test scores
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
## Function to create confusion matrix
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
y_predict = model.predict(X_test)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
## Function to calculate different metric scores of the model - Accuracy, Recall and Precision
def get_metrics_score(model,flag=True):
'''
model : classifier to predict values of X
'''
# defining an empty list to store train and test results
score_list=[]
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
train_acc = model.score(X_train,y_train)
test_acc = model.score(X_test,y_test)
train_recall = metrics.recall_score(y_train,pred_train)
test_recall = metrics.recall_score(y_test,pred_test)
train_precision = metrics.precision_score(y_train,pred_train)
test_precision = metrics.precision_score(y_test,pred_test)
score_list.extend((train_acc,test_acc,train_recall,test_recall,train_precision,test_precision))
# If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
if flag == True:
print("Accuracy on training set : ",model.score(X_train,y_train))
print("Accuracy on test set : ",model.score(X_test,y_test))
print("Recall on training set : ",metrics.recall_score(y_train,pred_train))
print("Recall on test set : ",metrics.recall_score(y_test,pred_test))
print("Precision on training set : ",metrics.precision_score(y_train,pred_train))
print("Precision on test set : ",metrics.precision_score(y_test,pred_test))
return score_list # returning the list with train and test scores
dtree = DecisionTreeClassifier(criterion='gini',class_weight={0:0.15,1:0.85},random_state=1)
dtree.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=1)
confusion_matrix_sklearn(dtree, X_test, y_test)
dtree_acc = get_accuracy_score(dtree)
dtree_recall = get_recall_score(dtree)
dtree_precision = get_precision_score(dtree)
Accuracy on training set : 1.0 Accuracy on test set : 0.8718473074301295 Recall on training set : 1.0 Recall on test set : 0.677536231884058 Precision on training set : 1.0 Precision on test set : 0.6538461538461539
kds.metrics.plot_cumulative_gain(y_test, dtree.predict(X_test))
kds.metrics.decile_table(y_test, dtree.predict(X_test))
LABELS INFO: prob_min : Minimum probability in a particular decile prob_max : Minimum probability in a particular decile prob_avg : Average probability in a particular decile cnt_events : Count of events in a particular decile cnt_resp : Count of responders in a particular decile cnt_non_resp : Count of non-responders in a particular decile cnt_resp_rndm : Count of responders if events assigned randomly in a particular decile cnt_resp_wiz : Count of best possible responders in a particular decile resp_rate : Response Rate in a particular decile [(cnt_resp/cnt_cust)*100] cum_events : Cumulative sum of events decile-wise cum_resp : Cumulative sum of responders decile-wise cum_resp_wiz : Cumulative sum of best possible responders decile-wise cum_non_resp : Cumulative sum of non-responders decile-wise cum_events_pct : Cumulative sum of percentages of events decile-wise cum_resp_pct : Cumulative sum of percentages of responders decile-wise cum_resp_pct_wiz : Cumulative sum of percentages of best possible responders decile-wise cum_non_resp_pct : Cumulative sum of percentages of non-responders decile-wise KS : KS Statistic decile-wise lift : Cumuative Lift Value decile-wise
| decile | prob_min | prob_max | prob_avg | cnt_cust | cnt_resp | cnt_non_resp | cnt_resp_rndm | cnt_resp_wiz | resp_rate | cum_cust | cum_resp | cum_resp_wiz | cum_non_resp | cum_cust_pct | cum_resp_pct | cum_resp_pct_wiz | cum_non_resp_pct | KS | lift | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1.0 | 1.0 | 1.000 | 147.0 | 98.0 | 49.0 | 27.6 | 147 | 66.667 | 147.0 | 98.0 | 147 | 49.0 | 10.020 | 35.507 | 53.261 | 4.114 | 31.393 | 3.544 |
| 1 | 2 | 0.0 | 1.0 | 0.946 | 147.0 | 89.0 | 58.0 | 27.6 | 129 | 60.544 | 294.0 | 187.0 | 276 | 107.0 | 20.041 | 67.754 | 100.000 | 8.984 | 58.770 | 3.381 |
| 2 | 3 | 0.0 | 0.0 | 0.000 | 147.0 | 11.0 | 136.0 | 27.6 | 0 | 7.483 | 441.0 | 198.0 | 276 | 243.0 | 30.061 | 71.739 | 100.000 | 20.403 | 51.336 | 2.386 |
| 3 | 4 | 0.0 | 0.0 | 0.000 | 146.0 | 5.0 | 141.0 | 27.6 | 0 | 3.425 | 587.0 | 203.0 | 276 | 384.0 | 40.014 | 73.551 | 100.000 | 32.242 | 41.309 | 1.838 |
| 4 | 5 | 0.0 | 0.0 | 0.000 | 147.0 | 16.0 | 131.0 | 27.6 | 0 | 10.884 | 734.0 | 219.0 | 276 | 515.0 | 50.034 | 79.348 | 100.000 | 43.241 | 36.107 | 1.586 |
| 5 | 6 | 0.0 | 0.0 | 0.000 | 147.0 | 8.0 | 139.0 | 27.6 | 0 | 5.442 | 881.0 | 227.0 | 276 | 654.0 | 60.055 | 82.246 | 100.000 | 54.912 | 27.334 | 1.370 |
| 6 | 7 | 0.0 | 0.0 | 0.000 | 146.0 | 13.0 | 133.0 | 27.6 | 0 | 8.904 | 1027.0 | 240.0 | 276 | 787.0 | 70.007 | 86.957 | 100.000 | 66.079 | 20.878 | 1.242 |
| 7 | 8 | 0.0 | 0.0 | 0.000 | 147.0 | 13.0 | 134.0 | 27.6 | 0 | 8.844 | 1174.0 | 253.0 | 276 | 921.0 | 80.027 | 91.667 | 100.000 | 77.330 | 14.337 | 1.145 |
| 8 | 9 | 0.0 | 0.0 | 0.000 | 147.0 | 11.0 | 136.0 | 27.6 | 0 | 7.483 | 1321.0 | 264.0 | 276 | 1057.0 | 90.048 | 95.652 | 100.000 | 88.749 | 6.903 | 1.062 |
| 9 | 10 | 0.0 | 0.0 | 0.000 | 146.0 | 12.0 | 134.0 | 27.6 | 0 | 8.219 | 1467.0 | 276.0 | 276 | 1191.0 | 100.000 | 100.000 | 100.000 | 100.000 | 0.000 | 1.000 |
bagging = BaggingClassifier(random_state=1)
bagging.fit(X_train,y_train)
BaggingClassifier(random_state=1)
confusion_matrix_sklearn(bagging, X_test, y_test)
bagging_acc = get_accuracy_score(bagging)
bagging_recall = get_recall_score(bagging)
bagging_precision = get_precision_score(bagging)
Accuracy on training set : 0.9947383805904706 Accuracy on test set : 0.9059304703476483 Recall on training set : 0.9736024844720497 Recall on test set : 0.5942028985507246 Precision on training set : 0.9984076433121019 Precision on test set : 0.8631578947368421
kds.metrics.plot_cumulative_gain(y_test, bagging.predict(X_test))
kds.metrics.decile_table(y_test, bagging.predict(X_test))
LABELS INFO: prob_min : Minimum probability in a particular decile prob_max : Minimum probability in a particular decile prob_avg : Average probability in a particular decile cnt_events : Count of events in a particular decile cnt_resp : Count of responders in a particular decile cnt_non_resp : Count of non-responders in a particular decile cnt_resp_rndm : Count of responders if events assigned randomly in a particular decile cnt_resp_wiz : Count of best possible responders in a particular decile resp_rate : Response Rate in a particular decile [(cnt_resp/cnt_cust)*100] cum_events : Cumulative sum of events decile-wise cum_resp : Cumulative sum of responders decile-wise cum_resp_wiz : Cumulative sum of best possible responders decile-wise cum_non_resp : Cumulative sum of non-responders decile-wise cum_events_pct : Cumulative sum of percentages of events decile-wise cum_resp_pct : Cumulative sum of percentages of responders decile-wise cum_resp_pct_wiz : Cumulative sum of percentages of best possible responders decile-wise cum_non_resp_pct : Cumulative sum of percentages of non-responders decile-wise KS : KS Statistic decile-wise lift : Cumuative Lift Value decile-wise
| decile | prob_min | prob_max | prob_avg | cnt_cust | cnt_resp | cnt_non_resp | cnt_resp_rndm | cnt_resp_wiz | resp_rate | cum_cust | cum_resp | cum_resp_wiz | cum_non_resp | cum_cust_pct | cum_resp_pct | cum_resp_pct_wiz | cum_non_resp_pct | KS | lift | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1.0 | 1.0 | 1.000 | 147.0 | 126.0 | 21.0 | 27.6 | 147 | 85.714 | 147.0 | 126.0 | 147 | 21.0 | 10.020 | 45.652 | 53.261 | 1.763 | 43.889 | 4.556 |
| 1 | 2 | 0.0 | 1.0 | 0.293 | 147.0 | 52.0 | 95.0 | 27.6 | 129 | 35.374 | 294.0 | 178.0 | 276 | 116.0 | 20.041 | 64.493 | 100.000 | 9.740 | 54.753 | 3.218 |
| 2 | 3 | 0.0 | 0.0 | 0.000 | 147.0 | 9.0 | 138.0 | 27.6 | 0 | 6.122 | 441.0 | 187.0 | 276 | 254.0 | 30.061 | 67.754 | 100.000 | 21.327 | 46.427 | 2.254 |
| 3 | 4 | 0.0 | 0.0 | 0.000 | 146.0 | 20.0 | 126.0 | 27.6 | 0 | 13.699 | 587.0 | 207.0 | 276 | 380.0 | 40.014 | 75.000 | 100.000 | 31.906 | 43.094 | 1.874 |
| 4 | 5 | 0.0 | 0.0 | 0.000 | 147.0 | 6.0 | 141.0 | 27.6 | 0 | 4.082 | 734.0 | 213.0 | 276 | 521.0 | 50.034 | 77.174 | 100.000 | 43.745 | 33.429 | 1.542 |
| 5 | 6 | 0.0 | 0.0 | 0.000 | 147.0 | 8.0 | 139.0 | 27.6 | 0 | 5.442 | 881.0 | 221.0 | 276 | 660.0 | 60.055 | 80.072 | 100.000 | 55.416 | 24.656 | 1.333 |
| 6 | 7 | 0.0 | 0.0 | 0.000 | 146.0 | 18.0 | 128.0 | 27.6 | 0 | 12.329 | 1027.0 | 239.0 | 276 | 788.0 | 70.007 | 86.594 | 100.000 | 66.163 | 20.431 | 1.237 |
| 7 | 8 | 0.0 | 0.0 | 0.000 | 147.0 | 11.0 | 136.0 | 27.6 | 0 | 7.483 | 1174.0 | 250.0 | 276 | 924.0 | 80.027 | 90.580 | 100.000 | 77.582 | 12.998 | 1.132 |
| 8 | 9 | 0.0 | 0.0 | 0.000 | 147.0 | 14.0 | 133.0 | 27.6 | 0 | 9.524 | 1321.0 | 264.0 | 276 | 1057.0 | 90.048 | 95.652 | 100.000 | 88.749 | 6.903 | 1.062 |
| 9 | 10 | 0.0 | 0.0 | 0.000 | 146.0 | 12.0 | 134.0 | 27.6 | 0 | 8.219 | 1467.0 | 276.0 | 276 | 1191.0 | 100.000 | 100.000 | 100.000 | 100.000 | 0.000 | 1.000 |
Bagging Classifier with weighted decision tree
bagging_wt = BaggingClassifier(base_estimator=DecisionTreeClassifier(criterion='gini',class_weight={0:0.15,1:0.85},random_state=1),random_state=1)
bagging_wt.fit(X_train,y_train)
BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight={0: 0.15,
1: 0.85},
random_state=1),
random_state=1)
confusion_matrix_sklearn(bagging_wt, X_test, y_test)
wt_bagging_acc = get_accuracy_score(bagging_wt)
wt_bagging_recall = get_recall_score(bagging_wt)
wt_bagging_precision = get_precision_score(bagging_wt)
Accuracy on training set : 0.9944460684010523 Accuracy on test set : 0.9032038173142468 Recall on training set : 0.9736024844720497 Recall on test set : 0.5471014492753623 Precision on training set : 0.9968203497615262 Precision on test set : 0.8988095238095238
kds.metrics.plot_cumulative_gain(y_test, bagging_wt.predict(X_test))
kds.metrics.decile_table(y_test, bagging_wt.predict(X_test))
LABELS INFO: prob_min : Minimum probability in a particular decile prob_max : Minimum probability in a particular decile prob_avg : Average probability in a particular decile cnt_events : Count of events in a particular decile cnt_resp : Count of responders in a particular decile cnt_non_resp : Count of non-responders in a particular decile cnt_resp_rndm : Count of responders if events assigned randomly in a particular decile cnt_resp_wiz : Count of best possible responders in a particular decile resp_rate : Response Rate in a particular decile [(cnt_resp/cnt_cust)*100] cum_events : Cumulative sum of events decile-wise cum_resp : Cumulative sum of responders decile-wise cum_resp_wiz : Cumulative sum of best possible responders decile-wise cum_non_resp : Cumulative sum of non-responders decile-wise cum_events_pct : Cumulative sum of percentages of events decile-wise cum_resp_pct : Cumulative sum of percentages of responders decile-wise cum_resp_pct_wiz : Cumulative sum of percentages of best possible responders decile-wise cum_non_resp_pct : Cumulative sum of percentages of non-responders decile-wise KS : KS Statistic decile-wise lift : Cumuative Lift Value decile-wise
| decile | prob_min | prob_max | prob_avg | cnt_cust | cnt_resp | cnt_non_resp | cnt_resp_rndm | cnt_resp_wiz | resp_rate | cum_cust | cum_resp | cum_resp_wiz | cum_non_resp | cum_cust_pct | cum_resp_pct | cum_resp_pct_wiz | cum_non_resp_pct | KS | lift | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1.0 | 1.0 | 1.000 | 147.0 | 131.0 | 16.0 | 27.6 | 147 | 89.116 | 147.0 | 131.0 | 147 | 16.0 | 10.020 | 47.464 | 53.261 | 1.343 | 46.121 | 4.737 |
| 1 | 2 | 0.0 | 1.0 | 0.143 | 147.0 | 35.0 | 112.0 | 27.6 | 129 | 23.810 | 294.0 | 166.0 | 276 | 128.0 | 20.041 | 60.145 | 100.000 | 10.747 | 49.398 | 3.001 |
| 2 | 3 | 0.0 | 0.0 | 0.000 | 147.0 | 13.0 | 134.0 | 27.6 | 0 | 8.844 | 441.0 | 179.0 | 276 | 262.0 | 30.061 | 64.855 | 100.000 | 21.998 | 42.857 | 2.157 |
| 3 | 4 | 0.0 | 0.0 | 0.000 | 146.0 | 14.0 | 132.0 | 27.6 | 0 | 9.589 | 587.0 | 193.0 | 276 | 394.0 | 40.014 | 69.928 | 100.000 | 33.081 | 36.847 | 1.748 |
| 4 | 5 | 0.0 | 0.0 | 0.000 | 147.0 | 12.0 | 135.0 | 27.6 | 0 | 8.163 | 734.0 | 205.0 | 276 | 529.0 | 50.034 | 74.275 | 100.000 | 44.416 | 29.859 | 1.484 |
| 5 | 6 | 0.0 | 0.0 | 0.000 | 147.0 | 11.0 | 136.0 | 27.6 | 0 | 7.483 | 881.0 | 216.0 | 276 | 665.0 | 60.055 | 78.261 | 100.000 | 55.835 | 22.426 | 1.303 |
| 6 | 7 | 0.0 | 0.0 | 0.000 | 146.0 | 19.0 | 127.0 | 27.6 | 0 | 13.014 | 1027.0 | 235.0 | 276 | 792.0 | 70.007 | 85.145 | 100.000 | 66.499 | 18.646 | 1.216 |
| 7 | 8 | 0.0 | 0.0 | 0.000 | 147.0 | 14.0 | 133.0 | 27.6 | 0 | 9.524 | 1174.0 | 249.0 | 276 | 925.0 | 80.027 | 90.217 | 100.000 | 77.666 | 12.551 | 1.127 |
| 8 | 9 | 0.0 | 0.0 | 0.000 | 147.0 | 15.0 | 132.0 | 27.6 | 0 | 10.204 | 1321.0 | 264.0 | 276 | 1057.0 | 90.048 | 95.652 | 100.000 | 88.749 | 6.903 | 1.062 |
| 9 | 10 | 0.0 | 0.0 | 0.000 | 146.0 | 12.0 | 134.0 | 27.6 | 0 | 8.219 | 1467.0 | 276.0 | 276 | 1191.0 | 100.000 | 100.000 | 100.000 | 100.000 | 0.000 | 1.000 |
rf = RandomForestClassifier(random_state=1)
rf.fit(X_train,y_train)
RandomForestClassifier(random_state=1)
confusion_matrix_sklearn(rf, X_test, y_test)
rf_acc = get_accuracy_score(rf)
rf_recall = get_recall_score(rf)
rf_precision = get_precision_score(rf)
Accuracy on training set : 1.0 Accuracy on test set : 0.9147920927062031 Recall on training set : 1.0 Recall on test set : 0.5833333333333334 Precision on training set : 1.0 Precision on test set : 0.9415204678362573
kds.metrics.plot_cumulative_gain(y_test, rf.predict(X_test))
kds.metrics.decile_table(y_test, rf.predict(X_test))
LABELS INFO: prob_min : Minimum probability in a particular decile prob_max : Minimum probability in a particular decile prob_avg : Average probability in a particular decile cnt_events : Count of events in a particular decile cnt_resp : Count of responders in a particular decile cnt_non_resp : Count of non-responders in a particular decile cnt_resp_rndm : Count of responders if events assigned randomly in a particular decile cnt_resp_wiz : Count of best possible responders in a particular decile resp_rate : Response Rate in a particular decile [(cnt_resp/cnt_cust)*100] cum_events : Cumulative sum of events decile-wise cum_resp : Cumulative sum of responders decile-wise cum_resp_wiz : Cumulative sum of best possible responders decile-wise cum_non_resp : Cumulative sum of non-responders decile-wise cum_events_pct : Cumulative sum of percentages of events decile-wise cum_resp_pct : Cumulative sum of percentages of responders decile-wise cum_resp_pct_wiz : Cumulative sum of percentages of best possible responders decile-wise cum_non_resp_pct : Cumulative sum of percentages of non-responders decile-wise KS : KS Statistic decile-wise lift : Cumuative Lift Value decile-wise
| decile | prob_min | prob_max | prob_avg | cnt_cust | cnt_resp | cnt_non_resp | cnt_resp_rndm | cnt_resp_wiz | resp_rate | cum_cust | cum_resp | cum_resp_wiz | cum_non_resp | cum_cust_pct | cum_resp_pct | cum_resp_pct_wiz | cum_non_resp_pct | KS | lift | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1.0 | 1.0 | 1.000 | 147.0 | 138.0 | 9.0 | 27.6 | 147 | 93.878 | 147.0 | 138.0 | 147 | 9.0 | 10.020 | 50.000 | 53.261 | 0.756 | 49.244 | 4.990 |
| 1 | 2 | 0.0 | 1.0 | 0.163 | 147.0 | 35.0 | 112.0 | 27.6 | 129 | 23.810 | 294.0 | 173.0 | 276 | 121.0 | 20.041 | 62.681 | 100.000 | 10.160 | 52.521 | 3.128 |
| 2 | 3 | 0.0 | 0.0 | 0.000 | 147.0 | 15.0 | 132.0 | 27.6 | 0 | 10.204 | 441.0 | 188.0 | 276 | 253.0 | 30.061 | 68.116 | 100.000 | 21.243 | 46.873 | 2.266 |
| 3 | 4 | 0.0 | 0.0 | 0.000 | 146.0 | 13.0 | 133.0 | 27.6 | 0 | 8.904 | 587.0 | 201.0 | 276 | 386.0 | 40.014 | 72.826 | 100.000 | 32.410 | 40.416 | 1.820 |
| 4 | 5 | 0.0 | 0.0 | 0.000 | 147.0 | 13.0 | 134.0 | 27.6 | 0 | 8.844 | 734.0 | 214.0 | 276 | 520.0 | 50.034 | 77.536 | 100.000 | 43.661 | 33.875 | 1.550 |
| 5 | 6 | 0.0 | 0.0 | 0.000 | 147.0 | 11.0 | 136.0 | 27.6 | 0 | 7.483 | 881.0 | 225.0 | 276 | 656.0 | 60.055 | 81.522 | 100.000 | 55.080 | 26.442 | 1.357 |
| 6 | 7 | 0.0 | 0.0 | 0.000 | 146.0 | 13.0 | 133.0 | 27.6 | 0 | 8.904 | 1027.0 | 238.0 | 276 | 789.0 | 70.007 | 86.232 | 100.000 | 66.247 | 19.985 | 1.232 |
| 7 | 8 | 0.0 | 0.0 | 0.000 | 147.0 | 13.0 | 134.0 | 27.6 | 0 | 8.844 | 1174.0 | 251.0 | 276 | 923.0 | 80.027 | 90.942 | 100.000 | 77.498 | 13.444 | 1.136 |
| 8 | 9 | 0.0 | 0.0 | 0.000 | 147.0 | 16.0 | 131.0 | 27.6 | 0 | 10.884 | 1321.0 | 267.0 | 276 | 1054.0 | 90.048 | 96.739 | 100.000 | 88.497 | 8.242 | 1.074 |
| 9 | 10 | 0.0 | 0.0 | 0.000 | 146.0 | 9.0 | 137.0 | 27.6 | 0 | 6.164 | 1467.0 | 276.0 | 276 | 1191.0 | 100.000 | 100.000 | 100.000 | 100.000 | 0.000 | 1.000 |
Random forest with class weights
rf_wt = RandomForestClassifier(class_weight={0:0.15,1:0.85}, random_state=1)
rf_wt.fit(X_train,y_train)
RandomForestClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=1)
confusion_matrix_sklearn(rf_wt, X_test, y_test)
wt_rf_acc = get_accuracy_score(rf_wt)
wt_rf_recall = get_recall_score(rf_wt)
wt_rf_precision = get_precision_score(rf_wt)
Accuracy on training set : 1.0 Accuracy on test set : 0.9052488070892979 Recall on training set : 1.0 Recall on test set : 0.5289855072463768 Precision on training set : 1.0 Precision on test set : 0.9419354838709677
kds.metrics.plot_cumulative_gain(y_test, rf_wt.predict(X_test))
kds.metrics.decile_table(y_test, rf_wt.predict(X_test))
LABELS INFO: prob_min : Minimum probability in a particular decile prob_max : Minimum probability in a particular decile prob_avg : Average probability in a particular decile cnt_events : Count of events in a particular decile cnt_resp : Count of responders in a particular decile cnt_non_resp : Count of non-responders in a particular decile cnt_resp_rndm : Count of responders if events assigned randomly in a particular decile cnt_resp_wiz : Count of best possible responders in a particular decile resp_rate : Response Rate in a particular decile [(cnt_resp/cnt_cust)*100] cum_events : Cumulative sum of events decile-wise cum_resp : Cumulative sum of responders decile-wise cum_resp_wiz : Cumulative sum of best possible responders decile-wise cum_non_resp : Cumulative sum of non-responders decile-wise cum_events_pct : Cumulative sum of percentages of events decile-wise cum_resp_pct : Cumulative sum of percentages of responders decile-wise cum_resp_pct_wiz : Cumulative sum of percentages of best possible responders decile-wise cum_non_resp_pct : Cumulative sum of percentages of non-responders decile-wise KS : KS Statistic decile-wise lift : Cumuative Lift Value decile-wise
| decile | prob_min | prob_max | prob_avg | cnt_cust | cnt_resp | cnt_non_resp | cnt_resp_rndm | cnt_resp_wiz | resp_rate | cum_cust | cum_resp | cum_resp_wiz | cum_non_resp | cum_cust_pct | cum_resp_pct | cum_resp_pct_wiz | cum_non_resp_pct | KS | lift | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1.0 | 1.0 | 1.000 | 147.0 | 138.0 | 9.0 | 27.6 | 147 | 93.878 | 147.0 | 138.0 | 147 | 9.0 | 10.020 | 50.000 | 53.261 | 0.756 | 49.244 | 4.990 |
| 1 | 2 | 0.0 | 1.0 | 0.054 | 147.0 | 23.0 | 124.0 | 27.6 | 129 | 15.646 | 294.0 | 161.0 | 276 | 133.0 | 20.041 | 58.333 | 100.000 | 11.167 | 47.166 | 2.911 |
| 2 | 3 | 0.0 | 0.0 | 0.000 | 147.0 | 12.0 | 135.0 | 27.6 | 0 | 8.163 | 441.0 | 173.0 | 276 | 268.0 | 30.061 | 62.681 | 100.000 | 22.502 | 40.179 | 2.085 |
| 3 | 4 | 0.0 | 0.0 | 0.000 | 146.0 | 22.0 | 124.0 | 27.6 | 0 | 15.068 | 587.0 | 195.0 | 276 | 392.0 | 40.014 | 70.652 | 100.000 | 32.914 | 37.738 | 1.766 |
| 4 | 5 | 0.0 | 0.0 | 0.000 | 147.0 | 7.0 | 140.0 | 27.6 | 0 | 4.762 | 734.0 | 202.0 | 276 | 532.0 | 50.034 | 73.188 | 100.000 | 44.668 | 28.520 | 1.463 |
| 5 | 6 | 0.0 | 0.0 | 0.000 | 147.0 | 13.0 | 134.0 | 27.6 | 0 | 8.844 | 881.0 | 215.0 | 276 | 666.0 | 60.055 | 77.899 | 100.000 | 55.919 | 21.980 | 1.297 |
| 6 | 7 | 0.0 | 0.0 | 0.000 | 146.0 | 16.0 | 130.0 | 27.6 | 0 | 10.959 | 1027.0 | 231.0 | 276 | 796.0 | 70.007 | 83.696 | 100.000 | 66.835 | 16.861 | 1.196 |
| 7 | 8 | 0.0 | 0.0 | 0.000 | 147.0 | 13.0 | 134.0 | 27.6 | 0 | 8.844 | 1174.0 | 244.0 | 276 | 930.0 | 80.027 | 88.406 | 100.000 | 78.086 | 10.320 | 1.105 |
| 8 | 9 | 0.0 | 0.0 | 0.000 | 147.0 | 21.0 | 126.0 | 27.6 | 0 | 14.286 | 1321.0 | 265.0 | 276 | 1056.0 | 90.048 | 96.014 | 100.000 | 88.665 | 7.349 | 1.066 |
| 9 | 10 | 0.0 | 0.0 | 0.000 | 146.0 | 11.0 | 135.0 | 27.6 | 0 | 7.534 | 1467.0 | 276.0 | 276 | 1191.0 | 100.000 | 100.000 | 100.000 | 100.000 | 0.000 | 1.000 |
Tuning Decision Tree
# Choose the type of classifier.
dtree_estimator = DecisionTreeClassifier(class_weight={0:0.15,1:0.85},random_state=1)
# Grid of parameters to choose from
parameters = {'max_depth': np.arange(2,30), # depth of the tree
'min_samples_leaf': [1, 2, 5, 7, 10], # The minimum number of samples required to be at a leaf node.
'max_leaf_nodes' : [2, 3, 5, 10,15], # Max number of leaf nodes. Grow a tree in best first fashion, in terms of impurity reduction
'min_impurity_decrease': [0.0001,0.001,0.01,0.1] # A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(dtree_estimator, parameters, scoring=scorer)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
dtree_estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
dtree_estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, max_depth=2,
max_leaf_nodes=2, min_impurity_decrease=0.1,
random_state=1)
confusion_matrix_sklearn(dtree_estimator, X_test, y_test)
tuned_dtree_acc = get_accuracy_score(dtree_estimator)
tuned_dtree_recall = get_recall_score(dtree_estimator)
tuned_dtree_precision = get_precision_score(dtree_estimator)
Accuracy on training set : 0.1882490499853844 Accuracy on test set : 0.18813905930470348 Recall on training set : 1.0 Recall on test set : 1.0 Precision on training set : 0.1882490499853844 Precision on test set : 0.18813905930470348
kds.metrics.plot_cumulative_gain(y_test, dtree_estimator.predict(X_test))
kds.metrics.decile_table(y_test, dtree_estimator.predict(X_test))
LABELS INFO: prob_min : Minimum probability in a particular decile prob_max : Minimum probability in a particular decile prob_avg : Average probability in a particular decile cnt_events : Count of events in a particular decile cnt_resp : Count of responders in a particular decile cnt_non_resp : Count of non-responders in a particular decile cnt_resp_rndm : Count of responders if events assigned randomly in a particular decile cnt_resp_wiz : Count of best possible responders in a particular decile resp_rate : Response Rate in a particular decile [(cnt_resp/cnt_cust)*100] cum_events : Cumulative sum of events decile-wise cum_resp : Cumulative sum of responders decile-wise cum_resp_wiz : Cumulative sum of best possible responders decile-wise cum_non_resp : Cumulative sum of non-responders decile-wise cum_events_pct : Cumulative sum of percentages of events decile-wise cum_resp_pct : Cumulative sum of percentages of responders decile-wise cum_resp_pct_wiz : Cumulative sum of percentages of best possible responders decile-wise cum_non_resp_pct : Cumulative sum of percentages of non-responders decile-wise KS : KS Statistic decile-wise lift : Cumuative Lift Value decile-wise
| decile | prob_min | prob_max | prob_avg | cnt_cust | cnt_resp | cnt_non_resp | cnt_resp_rndm | cnt_resp_wiz | resp_rate | cum_cust | cum_resp | cum_resp_wiz | cum_non_resp | cum_cust_pct | cum_resp_pct | cum_resp_pct_wiz | cum_non_resp_pct | KS | lift | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1.0 | 1.0 | 1.0 | 147.0 | 24.0 | 123.0 | 27.6 | 147 | 16.327 | 147.0 | 24.0 | 147 | 123.0 | 10.020 | 8.696 | 53.261 | 10.327 | -1.631 | 0.868 |
| 1 | 2 | 1.0 | 1.0 | 1.0 | 147.0 | 19.0 | 128.0 | 27.6 | 129 | 12.925 | 294.0 | 43.0 | 276 | 251.0 | 20.041 | 15.580 | 100.000 | 21.075 | -5.495 | 0.777 |
| 2 | 3 | 1.0 | 1.0 | 1.0 | 147.0 | 30.0 | 117.0 | 27.6 | 0 | 20.408 | 441.0 | 73.0 | 276 | 368.0 | 30.061 | 26.449 | 100.000 | 30.898 | -4.449 | 0.880 |
| 3 | 4 | 1.0 | 1.0 | 1.0 | 146.0 | 29.0 | 117.0 | 27.6 | 0 | 19.863 | 587.0 | 102.0 | 276 | 485.0 | 40.014 | 36.957 | 100.000 | 40.722 | -3.765 | 0.924 |
| 4 | 5 | 1.0 | 1.0 | 1.0 | 147.0 | 26.0 | 121.0 | 27.6 | 0 | 17.687 | 734.0 | 128.0 | 276 | 606.0 | 50.034 | 46.377 | 100.000 | 50.882 | -4.505 | 0.927 |
| 5 | 6 | 1.0 | 1.0 | 1.0 | 147.0 | 34.0 | 113.0 | 27.6 | 0 | 23.129 | 881.0 | 162.0 | 276 | 719.0 | 60.055 | 58.696 | 100.000 | 60.369 | -1.673 | 0.977 |
| 6 | 7 | 1.0 | 1.0 | 1.0 | 146.0 | 20.0 | 126.0 | 27.6 | 0 | 13.699 | 1027.0 | 182.0 | 276 | 845.0 | 70.007 | 65.942 | 100.000 | 70.949 | -5.007 | 0.942 |
| 7 | 8 | 1.0 | 1.0 | 1.0 | 147.0 | 27.0 | 120.0 | 27.6 | 0 | 18.367 | 1174.0 | 209.0 | 276 | 965.0 | 80.027 | 75.725 | 100.000 | 81.024 | -5.299 | 0.946 |
| 8 | 9 | 1.0 | 1.0 | 1.0 | 147.0 | 32.0 | 115.0 | 27.6 | 0 | 21.769 | 1321.0 | 241.0 | 276 | 1080.0 | 90.048 | 87.319 | 100.000 | 90.680 | -3.361 | 0.970 |
| 9 | 10 | 1.0 | 1.0 | 1.0 | 146.0 | 35.0 | 111.0 | 27.6 | 0 | 23.973 | 1467.0 | 276.0 | 276 | 1191.0 | 100.000 | 100.000 | 100.000 | 100.000 | 0.000 | 1.000 |
Tuning Bagging Classifier
# grid search for bagging classifier
cl1 = DecisionTreeClassifier(class_weight={0:0.15,1:0.85},random_state=1)
param_grid = {'base_estimator':[cl1], # The estimator to bag, the default is Decision Tree
'n_estimators':[5,7,15,51,101], # Number of estimators in the bagged classifier
'max_features': [0.7,0.8,0.9,1] # The max number of features available to split at each node. default= 1.0
}
grid = GridSearchCV(BaggingClassifier(random_state=1,bootstrap=True), param_grid=param_grid, scoring = 'recall', cv = 5)
grid.fit(X_train, y_train)
GridSearchCV(cv=5, estimator=BaggingClassifier(random_state=1),
param_grid={'base_estimator': [DecisionTreeClassifier(class_weight={0: 0.15,
1: 0.85},
random_state=1)],
'max_features': [0.7, 0.8, 0.9, 1],
'n_estimators': [5, 7, 15, 51, 101]},
scoring='recall')
## getting the best estimator
bagging_estimator = grid.best_estimator_
bagging_estimator.fit(X_train,y_train)
BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight={0: 0.15,
1: 0.85},
random_state=1),
max_features=1, n_estimators=101, random_state=1)
confusion_matrix_sklearn(bagging_estimator, X_test, y_test)
tuned_bagging_acc= get_accuracy_score(bagging_estimator)
tuned_bagging_recall = get_recall_score(bagging_estimator)
tuned_bagging_precision = get_precision_score(bagging_estimator)
Accuracy on training set : 0.2098801520023385 Accuracy on test set : 0.2010906612133606 Recall on training set : 1.0 Recall on test set : 1.0 Precision on training set : 0.19241111443083359 Precision on test set : 0.19060773480662985
kds.metrics.plot_cumulative_gain(y_test, bagging_estimator.predict(X_test))
kds.metrics.decile_table(y_test, bagging_estimator.predict(X_test))
LABELS INFO: prob_min : Minimum probability in a particular decile prob_max : Minimum probability in a particular decile prob_avg : Average probability in a particular decile cnt_events : Count of events in a particular decile cnt_resp : Count of responders in a particular decile cnt_non_resp : Count of non-responders in a particular decile cnt_resp_rndm : Count of responders if events assigned randomly in a particular decile cnt_resp_wiz : Count of best possible responders in a particular decile resp_rate : Response Rate in a particular decile [(cnt_resp/cnt_cust)*100] cum_events : Cumulative sum of events decile-wise cum_resp : Cumulative sum of responders decile-wise cum_resp_wiz : Cumulative sum of best possible responders decile-wise cum_non_resp : Cumulative sum of non-responders decile-wise cum_events_pct : Cumulative sum of percentages of events decile-wise cum_resp_pct : Cumulative sum of percentages of responders decile-wise cum_resp_pct_wiz : Cumulative sum of percentages of best possible responders decile-wise cum_non_resp_pct : Cumulative sum of percentages of non-responders decile-wise KS : KS Statistic decile-wise lift : Cumuative Lift Value decile-wise
| decile | prob_min | prob_max | prob_avg | cnt_cust | cnt_resp | cnt_non_resp | cnt_resp_rndm | cnt_resp_wiz | resp_rate | cum_cust | cum_resp | cum_resp_wiz | cum_non_resp | cum_cust_pct | cum_resp_pct | cum_resp_pct_wiz | cum_non_resp_pct | KS | lift | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1.0 | 1.0 | 1.00 | 147.0 | 26.0 | 121.0 | 27.6 | 147 | 17.687 | 147.0 | 26.0 | 147 | 121.0 | 10.020 | 9.420 | 53.261 | 10.160 | -0.740 | 0.940 |
| 1 | 2 | 1.0 | 1.0 | 1.00 | 147.0 | 17.0 | 130.0 | 27.6 | 129 | 11.565 | 294.0 | 43.0 | 276 | 251.0 | 20.041 | 15.580 | 100.000 | 21.075 | -5.495 | 0.777 |
| 2 | 3 | 1.0 | 1.0 | 1.00 | 147.0 | 31.0 | 116.0 | 27.6 | 0 | 21.088 | 441.0 | 74.0 | 276 | 367.0 | 30.061 | 26.812 | 100.000 | 30.814 | -4.002 | 0.892 |
| 3 | 4 | 1.0 | 1.0 | 1.00 | 146.0 | 28.0 | 118.0 | 27.6 | 0 | 19.178 | 587.0 | 102.0 | 276 | 485.0 | 40.014 | 36.957 | 100.000 | 40.722 | -3.765 | 0.924 |
| 4 | 5 | 1.0 | 1.0 | 1.00 | 147.0 | 27.0 | 120.0 | 27.6 | 0 | 18.367 | 734.0 | 129.0 | 276 | 605.0 | 50.034 | 46.739 | 100.000 | 50.798 | -4.059 | 0.934 |
| 5 | 6 | 1.0 | 1.0 | 1.00 | 147.0 | 35.0 | 112.0 | 27.6 | 0 | 23.810 | 881.0 | 164.0 | 276 | 717.0 | 60.055 | 59.420 | 100.000 | 60.202 | -0.782 | 0.989 |
| 6 | 7 | 1.0 | 1.0 | 1.00 | 146.0 | 18.0 | 128.0 | 27.6 | 0 | 12.329 | 1027.0 | 182.0 | 276 | 845.0 | 70.007 | 65.942 | 100.000 | 70.949 | -5.007 | 0.942 |
| 7 | 8 | 1.0 | 1.0 | 1.00 | 147.0 | 27.0 | 120.0 | 27.6 | 0 | 18.367 | 1174.0 | 209.0 | 276 | 965.0 | 80.027 | 75.725 | 100.000 | 81.024 | -5.299 | 0.946 |
| 8 | 9 | 1.0 | 1.0 | 1.00 | 147.0 | 38.0 | 109.0 | 27.6 | 0 | 25.850 | 1321.0 | 247.0 | 276 | 1074.0 | 90.048 | 89.493 | 100.000 | 90.176 | -0.683 | 0.994 |
| 9 | 10 | 0.0 | 1.0 | 0.87 | 146.0 | 29.0 | 117.0 | 27.6 | 0 | 19.863 | 1467.0 | 276.0 | 276 | 1191.0 | 100.000 | 100.000 | 100.000 | 100.000 | 0.000 | 1.000 |
Tuning Random Forest
# Choose the type of classifier.
rf_estimator = RandomForestClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
"n_estimators": [110,251,501], # number of decision trees in the forest
"min_samples_leaf": np.arange(1, 6), # The minimum number of samples required to be at a leaf node.
"max_features": [0.7,0.9,'log2','auto'], # max features available to split at each node. default= sqrt(n_features) or 'auto'
"max_samples": [0.7,0.9,None], # If bootstrap is True, the number of samples to draw from X to train each base estimator.
## If None (default), then draw n_features features/ X.shape[0] samples.
## If int, then draw max_features features and max_samples samples.
## If float, then draw max_features * n_features features. Thus max_features should be in the interval (0.0,1.0]
# then draw max_samples * X.shape[0] samples. Thus, max_samples should be in the interval (0.0, 1.0]
}
# Run the grid search
grid_obj = GridSearchCV(rf_estimator, parameters, scoring='recall',cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
rf_estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
rf_estimator.fit(X_train, y_train)
RandomForestClassifier(max_features=0.7, n_estimators=501, random_state=1)
confusion_matrix_sklearn(rf_estimator, X_test, y_test)
tuned_rf_acc = get_accuracy_score(rf_estimator)
tuned_rf_recall = get_recall_score(rf_estimator)
tuned_rf_precision = get_precision_score(rf_estimator)
Accuracy on training set : 1.0 Accuracy on test set : 0.9325153374233128 Recall on training set : 1.0 Recall on test set : 0.6992753623188406 Precision on training set : 1.0 Precision on test set : 0.9234449760765551
kds.metrics.plot_cumulative_gain(y_test, rf_estimator.predict(X_test))
kds.metrics.decile_table(y_test, rf_estimator.predict(X_test))
LABELS INFO: prob_min : Minimum probability in a particular decile prob_max : Minimum probability in a particular decile prob_avg : Average probability in a particular decile cnt_events : Count of events in a particular decile cnt_resp : Count of responders in a particular decile cnt_non_resp : Count of non-responders in a particular decile cnt_resp_rndm : Count of responders if events assigned randomly in a particular decile cnt_resp_wiz : Count of best possible responders in a particular decile resp_rate : Response Rate in a particular decile [(cnt_resp/cnt_cust)*100] cum_events : Cumulative sum of events decile-wise cum_resp : Cumulative sum of responders decile-wise cum_resp_wiz : Cumulative sum of best possible responders decile-wise cum_non_resp : Cumulative sum of non-responders decile-wise cum_events_pct : Cumulative sum of percentages of events decile-wise cum_resp_pct : Cumulative sum of percentages of responders decile-wise cum_resp_pct_wiz : Cumulative sum of percentages of best possible responders decile-wise cum_non_resp_pct : Cumulative sum of percentages of non-responders decile-wise KS : KS Statistic decile-wise lift : Cumuative Lift Value decile-wise
| decile | prob_min | prob_max | prob_avg | cnt_cust | cnt_resp | cnt_non_resp | cnt_resp_rndm | cnt_resp_wiz | resp_rate | cum_cust | cum_resp | cum_resp_wiz | cum_non_resp | cum_cust_pct | cum_resp_pct | cum_resp_pct_wiz | cum_non_resp_pct | KS | lift | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1.0 | 1.0 | 1.000 | 147.0 | 136.0 | 11.0 | 27.6 | 147 | 92.517 | 147.0 | 136.0 | 147 | 11.0 | 10.020 | 49.275 | 53.261 | 0.924 | 48.351 | 4.918 |
| 1 | 2 | 0.0 | 1.0 | 0.422 | 147.0 | 66.0 | 81.0 | 27.6 | 129 | 44.898 | 294.0 | 202.0 | 276 | 92.0 | 20.041 | 73.188 | 100.000 | 7.725 | 65.463 | 3.652 |
| 2 | 3 | 0.0 | 0.0 | 0.000 | 147.0 | 6.0 | 141.0 | 27.6 | 0 | 4.082 | 441.0 | 208.0 | 276 | 233.0 | 30.061 | 75.362 | 100.000 | 19.563 | 55.799 | 2.507 |
| 3 | 4 | 0.0 | 0.0 | 0.000 | 146.0 | 15.0 | 131.0 | 27.6 | 0 | 10.274 | 587.0 | 223.0 | 276 | 364.0 | 40.014 | 80.797 | 100.000 | 30.563 | 50.234 | 2.019 |
| 4 | 5 | 0.0 | 0.0 | 0.000 | 147.0 | 7.0 | 140.0 | 27.6 | 0 | 4.762 | 734.0 | 230.0 | 276 | 504.0 | 50.034 | 83.333 | 100.000 | 42.317 | 41.016 | 1.666 |
| 5 | 6 | 0.0 | 0.0 | 0.000 | 147.0 | 5.0 | 142.0 | 27.6 | 0 | 3.401 | 881.0 | 235.0 | 276 | 646.0 | 60.055 | 85.145 | 100.000 | 54.240 | 30.905 | 1.418 |
| 6 | 7 | 0.0 | 0.0 | 0.000 | 146.0 | 11.0 | 135.0 | 27.6 | 0 | 7.534 | 1027.0 | 246.0 | 276 | 781.0 | 70.007 | 89.130 | 100.000 | 65.575 | 23.555 | 1.273 |
| 7 | 8 | 0.0 | 0.0 | 0.000 | 147.0 | 9.0 | 138.0 | 27.6 | 0 | 6.122 | 1174.0 | 255.0 | 276 | 919.0 | 80.027 | 92.391 | 100.000 | 77.162 | 15.229 | 1.154 |
| 8 | 9 | 0.0 | 0.0 | 0.000 | 147.0 | 12.0 | 135.0 | 27.6 | 0 | 8.163 | 1321.0 | 267.0 | 276 | 1054.0 | 90.048 | 96.739 | 100.000 | 88.497 | 8.242 | 1.074 |
| 9 | 10 | 0.0 | 0.0 | 0.000 | 146.0 | 9.0 | 137.0 | 27.6 | 0 | 6.164 | 1467.0 | 276.0 | 276 | 1191.0 | 100.000 | 100.000 | 100.000 | 100.000 | 0.000 | 1.000 |
# defining list of models
models = [dtree,dtree_estimator,bagging,bagging_wt,bagging_estimator,rf,rf_wt,rf_estimator]
# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
# looping through all the models to get the accuracy,recall and precision scores
for model in models:
# accuracy score
j = get_accuracy_score(model,False)
acc_train.append(j[0])
acc_test.append(j[1])
# recall score
k = get_recall_score(model,False)
recall_train.append(k[0])
recall_test.append(k[1])
# precision score
l = get_precision_score(model,False)
precision_train.append(l[0])
precision_test.append(l[1])
#gain = np.array([90.845, 84.507, 81.690, 88.732, 76.761, 60.563, 89.437])
comparison_frame = pd.DataFrame({'Model':['Decision Tree','Tuned Decision Tree','Bagging Classifier',
'Weighted Bagging Classifier','Tuned Bagging Classifier',
'Random Forest','Weighted Random Forest','Tuned Random Forest'],
'Train_Accuracy': acc_train,
'Test_Accuracy': acc_test,
'Train_Recall': recall_train,
'Test_Recall': recall_test,
'Train_Precision': precision_train,
'Test_Precision': precision_test#,
#'Test_Gain_5thDecile' :[79.348,77.174,74.275,77.536,73.188,46.377,46.739,83.333]
})
comparison_frame
| Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | |
|---|---|---|---|---|---|---|---|
| 0 | Decision Tree | 1.000000 | 0.871847 | 1.000000 | 0.677536 | 1.000000 | 0.653846 |
| 1 | Tuned Decision Tree | 0.188249 | 0.188139 | 1.000000 | 1.000000 | 0.188249 | 0.188139 |
| 2 | Bagging Classifier | 0.994738 | 0.905930 | 0.973602 | 0.594203 | 0.998408 | 0.863158 |
| 3 | Weighted Bagging Classifier | 0.994446 | 0.903204 | 0.973602 | 0.547101 | 0.996820 | 0.898810 |
| 4 | Tuned Bagging Classifier | 0.209880 | 0.201091 | 1.000000 | 1.000000 | 0.192411 | 0.190608 |
| 5 | Random Forest | 1.000000 | 0.914792 | 1.000000 | 0.583333 | 1.000000 | 0.941520 |
| 6 | Weighted Random Forest | 1.000000 | 0.905249 | 1.000000 | 0.528986 | 1.000000 | 0.941935 |
| 7 | Tuned Random Forest | 1.000000 | 0.932515 | 1.000000 | 0.699275 | 1.000000 | 0.923445 |
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print (pd.DataFrame(rf.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp logIncome 0.130667 Age 0.120616 DurationOfPitch 0.101149 Passport 0.068964 NumberOfTrips 0.064980 PitchSatisfactionScore 0.056532 NumberOfFollowups 0.054536 PreferredPropertyStar 0.042779 ProductPitched 0.038002 CityTier 0.037960 Designation 0.037147 NumberOfChildrenVisiting 0.032288 NumberOfPersonVisiting 0.028921 MaritalStatus_Single 0.028156 Gender_Male 0.025276 TypeofContact_Self Enquiry 0.024929 OwnCar 0.021368 Occupation_Small Business 0.018193 Occupation_Salaried 0.018184 MaritalStatus_Married 0.017668 MaritalStatus_Unmarried 0.017088 Occupation_Large Business 0.014598
feature_names = X_train.columns
importances = rf_estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Recall - It gives the ratio of True positives to Actual positives, so high Recall implies low false negatives, i.e. low chances of predicting a buyer as a non-buyerabc = AdaBoostClassifier(random_state=1)
abc.fit(X_train,y_train)
AdaBoostClassifier(random_state=1)
#Using above defined function to get accuracy, recall and precision on train and test set
abc_score=get_metrics_score(abc)
Accuracy on training set : 0.8459514761765565 Accuracy on test set : 0.8445807770961146 Recall on training set : 0.3167701863354037 Recall on test set : 0.32971014492753625 Precision on training set : 0.7010309278350515 Precision on test set : 0.6791044776119403
make_confusion_matrix(abc,y_test)
gbc = GradientBoostingClassifier(random_state=1)
gbc.fit(X_train,y_train)
GradientBoostingClassifier(random_state=1)
#Using above defined function to get accuracy, recall and precision on train and test set
gbc_score=get_metrics_score(gbc)
Accuracy on training set : 0.8839520608009354 Accuracy on test set : 0.8629856850715747 Recall on training set : 0.4503105590062112 Recall on test set : 0.3804347826086957 Precision on training set : 0.8708708708708709 Precision on test set : 0.7777777777777778
make_confusion_matrix(gbc,y_test)
xgb = XGBClassifier(random_state=1,eval_metric='logloss')
xgb.fit(X_train,y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, eval_metric='logloss',
gamma=0, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=8,
num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=None)
#Using above defined function to get accuracy, recall and precision on train and test set
xgb_score=get_metrics_score(xgb)
Accuracy on training set : 0.9997076878105817 Accuracy on test set : 0.9250170415814588 Recall on training set : 0.9984472049689441 Recall on test set : 0.6956521739130435 Precision on training set : 1.0 Precision on test set : 0.8807339449541285
make_confusion_matrix(xgb,y_test)
With default parameters:
# Choose the type of classifier.
abc_tuned = AdaBoostClassifier(random_state=1)
# Grid of parameters to choose from
## add from article
parameters = {
#Let's try different max_depth for base_estimator
"base_estimator":[DecisionTreeClassifier(max_depth=1, random_state=1),DecisionTreeClassifier(max_depth=2, random_state=1),DecisionTreeClassifier(max_depth=3, random_state=1)],
"n_estimators": np.arange(10,110,10),
"learning_rate":np.arange(0.1,2,0.1)
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(abc_tuned, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
abc_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
abc_tuned.fit(X_train, y_train)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=1.2000000000000002, n_estimators=100,
random_state=1)
#Using above defined function to get accuracy, recall and precision on train and test set
abc_tuned_score=get_metrics_score(abc_tuned)
Accuracy on training set : 0.9912306343174511 Accuracy on test set : 0.8773006134969326 Recall on training set : 0.9627329192546584 Recall on test set : 0.6159420289855072 Precision on training set : 0.9904153354632588 Precision on test set : 0.6967213114754098
make_confusion_matrix(abc_tuned,y_test)
importances = abc_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Let's try using AdaBoost classifier as the estimator for initial predictions
gbc_init = GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1)
gbc_init.fit(X_train,y_train)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
random_state=1)
#Using above defined function to get accuracy, recall and precision on train and test set
gbc_init_score=get_metrics_score(gbc_init)
Accuracy on training set : 0.8848289973691903 Accuracy on test set : 0.8657123381049762 Recall on training set : 0.45652173913043476 Recall on test set : 0.391304347826087 Precision on training set : 0.8698224852071006 Precision on test set : 0.7883211678832117
As compared to the model with default parameters:
# Choose the type of classifier.
gbc_tuned = GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1)
# Grid of parameters to choose from
## add from article
parameters = {
"n_estimators": [100,150,200,250],
"subsample":[0.8,0.9,1],
"max_features":[0.7,0.8,0.9,1]
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(gbc_tuned, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
gbc_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
gbc_tuned.fit(X_train, y_train)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.7, n_estimators=250, random_state=1,
subsample=0.9)
#Using above defined function to get accuracy, recall and precision on train and test set
gbc_tuned_score=get_metrics_score(gbc_tuned)
Accuracy on training set : 0.9190295235311312 Accuracy on test set : 0.8766189502385822 Recall on training set : 0.6118012422360248 Recall on test set : 0.4601449275362319 Precision on training set : 0.9358669833729216 Precision on test set : 0.7987421383647799
make_confusion_matrix(gbc_tuned,y_test)
importances = gbc_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
XGBoost has many hyper parameters which can be tuned to increase the model performance. Some of the important parameters are:
# Choose the type of classifier.
xgb_tuned = XGBClassifier(random_state=1,eval_metric='logloss')
# Grid of parameters to choose from
## add from
parameters = {
"n_estimators": np.arange(10,100,20),
"scale_pos_weight":[0,1,2,5],
"subsample":[0.5,0.7,0.9,1],
"learning_rate":[0.01,0.1,0.2,0.05],
"gamma":[0,1,3],
"colsample_bytree":[0.5,0.7,0.9,1],
"colsample_bylevel":[0.5,0.7,0.9,1]
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(xgb_tuned, parameters,scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
xgb_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
xgb_tuned.fit(X_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=0.9, eval_metric='logloss',
gamma=1, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.1, max_delta_step=0,
max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=70, n_jobs=8,
num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=1,
scale_pos_weight=5, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=None)
#Using above defined function to get accuracy, recall and precision on train and test set
xgb_tuned_score=get_metrics_score(xgb_tuned)
Accuracy on training set : 0.9514761765565624 Accuracy on test set : 0.8841172460804363 Recall on training set : 0.984472049689441 Recall on test set : 0.8188405797101449 Precision on training set : 0.8025316455696202 Precision on test set : 0.653179190751445
make_confusion_matrix(xgb_tuned,y_test)
importances = xgb_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
# defining list of models
models = [abc, abc_tuned, gbc, gbc_init, gbc_tuned, xgb, xgb_tuned]
# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
# looping through all the models to get the accuracy, precall and precision scores
for model in models:
j = get_metrics_score(model,False)
acc_train.append(np.round(j[0],2))
acc_test.append(np.round(j[1],2))
recall_train.append(np.round(j[2],2))
recall_test.append(np.round(j[3],2))
precision_train.append(np.round(j[4],2))
precision_test.append(np.round(j[5],2))
comparison_frame = pd.DataFrame({'Model':['AdaBoost with default paramters','AdaBoost Tuned',
'Gradient Boosting with default parameters','Gradient Boosting with init=AdaBoost',
'Gradient Boosting Tuned','XGBoost with default parameters','XGBoost Tuned'],
'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
'Train_Recall':recall_train,'Test_Recall':recall_test,
'Train_Precision':precision_train,'Test_Precision':precision_test})
comparison_frame
| Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | |
|---|---|---|---|---|---|---|---|
| 0 | AdaBoost with default paramters | 0.85 | 0.84 | 0.32 | 0.33 | 0.70 | 0.68 |
| 1 | AdaBoost Tuned | 0.99 | 0.88 | 0.96 | 0.62 | 0.99 | 0.70 |
| 2 | Gradient Boosting with default parameters | 0.88 | 0.86 | 0.45 | 0.38 | 0.87 | 0.78 |
| 3 | Gradient Boosting with init=AdaBoost | 0.88 | 0.87 | 0.46 | 0.39 | 0.87 | 0.79 |
| 4 | Gradient Boosting Tuned | 0.92 | 0.88 | 0.61 | 0.46 | 0.94 | 0.80 |
| 5 | XGBoost with default parameters | 1.00 | 0.93 | 1.00 | 0.70 | 1.00 | 0.88 |
| 6 | XGBoost Tuned | 0.95 | 0.88 | 0.98 | 0.82 | 0.80 | 0.65 |
We have been able to build a predictive model:
a) that company can deploy this model to identify potential buyers who are likely to be buy the new package.
b) that company can use to find the drivers of the purchase decision.
c) based on which company can take appropriate actions to build better promotion schemes.
The optical models were found to be "Tuned Random Forest" (test precision = 100%, test recall = 70%) "XGBoost Tuned" (test precision = 65%, test recall = 82%) respectively for Bagging and Boosting approaches.